AITopics

2511.21568

Country:

Europe (0.46)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)

arXiv.org Artificial IntelligenceNov-18-2025

MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

He, Jiayi, Huang, Yangmin, Du, Qianyun, Zhou, Xiangying, He, Zhiyang, Hu, Jiaxue, Tao, Xiaodong, Lai, Lixian

Deploying Large Language Models (LLMs) in medical applications requires fact-checking capabilities to ensure patient safety and regulatory compliance. We introduce MedFact, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show models often determine if text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals the "over-criticism" phenomenon, a tendency for models to misidentify correct information as erroneous, which can be exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.

large language model, machine learning, medfact, (19 more...)

2509.1244

Country:

North America > United States (0.68)
Europe (0.46)
Asia (0.46)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Health Care Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsSep-30-2025, 08:53:56 GMT

Decoupled Variational Gaussian Inference

Variational Gaussian (VG) inference methods that optimize a lower bound to the marginal likelihood are a popular approach for Bayesian inference. These methods are fast and easy to use, while being reasonably accurate. A difficulty remains in computation of the lower bound when the latent dimensionality $L$ is large. Even though the lower bound is concave for many models, its computation requires optimization over $O(L^2)$ variational parameters. Efficient reparameterization schemes can reduce the number of parameters, but give inaccurate solutions or destroy concavity leading to slow convergence.

artificial intelligence, decoupled variational gaussian inference, name change, (4 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.60)

arXiv.org Artificial IntelligenceSep-30-2025

IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property

Wang, Qiyao, Chen, Guhong, Wang, Hongbo, Liu, Huaren, Zhu, Minghui, Qin, Zhifei, Li, Linwei, Yue, Yilin, Wang, Shiqiang, Li, Jiayan, Wu, Yihang, Liu, Ziqiang, Chen, Longze, Luo, Run, Fan, Liyang, Li, Jiaming, Zhang, Lei, Xu, Kan, Li, Chengming, Alinejad-Rokny, Hamid, Ni, Shiwen, Lin, Yuan, Yang, Min

Intellectual Property (IP) is a highly specialized domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. Recent advancements in LLMs have demonstrated their potential to handle IP-related tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce IPBench, the first comprehensive IP task taxonomy and a large-scale bilingual benchmark encompassing 8 IP mechanisms and 20 distinct tasks, designed to evaluate LLMs in real-world IP scenarios. We benchmark 17 main LLMs, ranging from general purpose to domain-specific, including chat-oriented and reasoning-focused models, under zero-shot, few-shot, and chain-of-thought settings. Our results show that even the top-performing model, DeepSeek-V3, achieves only 75.8% accuracy, indicating significant room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. To foster future research, we publicly release IPBench, and will expand it with additional tasks to better reflect real-world complexities and support model advancements in the IP domain. We provide the data and code in the supplementary URLs.

ipbench, large language model, machine learning, (18 more...)

2504.15524

Country:

Asia (1.00)
North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Industry: Law > Intellectual Property & Technology Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceAug-1-2025

Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers

Harris, Lee

Language models can capture complex relationships in given text, but these are notorious for being costly and for producing information that does not exist (i.e., hallucinations). Furthermore, the resources invested into producing this information would be wasted if it were incorrect. We address these issues by proposing, implementing, and applying the Language Model Chain (LMC) algorithm. In this, a language model's response to a given prompt about given text is only correct if it exists in the collection of possible (i.e., candidate) answers, and text corresponding to incorrect responses is fed into a more predictive (but slower) language model. This process is repeated for a collection of language models, or until all predictions about the text are correct. We used the LMC algorithm to extract patient dates of birth from medical documents, and combining a collection of language models in a multi-stage cascade significantly increased prediction speed and accuracy over individual language models, while greatly reducing the number of corresponding hallucinations. We believe that the novel LMC algorithm significantly contributes to the knowledge extraction field, and that this should be explored much further in the future.

large language model, machine learning, natural language, (16 more...)

2507.22921

Country: Asia (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.90)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

arXiv.org Artificial IntelligenceJun-3-2025

SocialEval: Evaluating Social Intelligence of Large Language Models

Zhou, Jinfeng, Chen, Yuxuan, Shi, Yihan, Zhang, Xuanming, Lei, Leqi, Feng, Yi, Xiong, Zexuan, Yan, Miao, Wang, Xunzhi, Cao, Yaru, Yin, Jianing, Wang, Shuai, Dai, Quanyu, Dong, Zhenhua, Wang, Hongning, Huang, Minlie

LLMs exhibit promising Social Intelligence (SI) in modeling human behavior, raising the need to evaluate LLMs' SI and their discrepancy with humans. SI equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals. This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation, which existing work fails to address. To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts. Each script is structured as a world tree that contains plot lines driven by interpersonal ability, providing a comprehensive view of how LLMs navigate social interactions. Experiments show that LLMs fall behind humans on both SI evaluations, exhibit prosociality, and prefer more positive social behaviors, even if they lead to goal failure. Analysis of LLMs' formed representation space and neuronal activations reveals that LLMs have developed ability-specific functional partitions akin to the human brain.

large language model, machine learning, natural language, (19 more...)

2506.009

Country:

Asia (0.67)
North America > United States > California (0.67)

Genre:

Research Report (1.00)
Personal > Interview (1.00)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceMar-13-2025

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Wang, Weiyun, Gao, Zhangwei, Chen, Lianjie, Chen, Zhe, Zhu, Jinguo, Zhao, Xiangyu, Liu, Yangzhou, Cao, Yue, Ye, Shenglong, Zhu, Xizhou, Lu, Lewei, Duan, Haodong, Qiao, Yu, Dai, Jifeng, Wang, Wenhai

We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.

arxiv preprint arxiv, benchmark, internvl2, (14 more...)

2503.10291

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Banking & Finance > Economy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Neural Information Processing SystemsFeb-12-2025, 02:33:02 GMT

Review for NeurIPS paper: Robust Optimization for Fairness with Noisy Protected Groups

The ethical reviewers (ERs) were asked to respond to a series of questions. I provide summaries of the questions and responses below. They were also asked to make an accept/reject recommendation.

bpd example, noisy protected group, reviewer, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Neural Information Processing SystemsFeb-9-2025, 11:46:48 GMT

Communication-Efficient Distributed Dual Coordinate Ascent

Martin Jaggi, Virginia Smith, Martin Takac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, Michael I. Jordan

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning.

algorithm, artificial intelligence, machine learning, (17 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Virginia (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Neural Information Processing SystemsJan-20-2025, 04:49:20 GMT

Reviews: Identification of Gaussian Process State Space Models

The authors derive a variational objective for inference and hyperparameter learning in a GPSSM. The authors apply a mean field variational approximation to the distribution over inducing points and a Gaussian approximation with Markov structure to the distribution over the sequence of latent states. The parameters of the latter depend on a bi-RNN. The variational bound is optimised using doubly stochastic gradient optimisation. The authors apply their algorithm to three simulated data examples, showing that particular applications may require the ability to flexibly choose kernel functions and that the algorithm recovers meaningful structure in the latent states.

algorithm, gaussian process state space model, identification, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning (0.59)